Mikko Tolonen, Leo Lahti
June 12, 2015
Emphasis on research process
Transparency (data, methods, reporting)
Reproducibility
Openness (unlimited access and reuse)
New modes of collaboration and initiatives
Access to data is an institutional question. Using and tidying up the data is a research question
Automation vs. point-n-click ?
Load the data and tools in R:
load("df.RData")
library(bibliographica)Estimate missing dimensions
kable(polish_dimensions("10 cm (12⁰)", fill = TRUE))| original | gatherings | width | height | area |
|---|---|---|---|---|
| 10 cm (12⁰) | 12to | 10 | 15 | 150 |
Enriching data by external information
as.matrix(get_gender(polish_author(sample(unique(df$author.name), 20))$first)$gender)## [,1]
## hugh "male"
## william "male"
## daniel "male"
## edward "male"
## john "male"
## william "male"
## gilbert "male"
## samuel "male"
## henry "male"
## levi "male"
## zachary "male"
## alexander "male"
## john "male"
## anthony "male"
## philippe "male"
## david "male"
## robert "male"
## edward "male"
## calybute NA
## gilbert "male"
Authors (number of titles / paper use / life years)
Times 1470 - 1800 ?
Places: London, Ireland, Scotland, North America.. ?
Language ?
Gender ?
top_plot(df, "author.unique", 20)Document count vs. paper for top authors
ggplot(df2, aes(x = docs, y = paper)) + geom_text(aes(label = author.unique), size = 4)Gender distribution for authors over time. Note that the name-gender mappings change over time. This has not been taken into account yet.
##
## female male
## 0.037 0.963
Publishing times 1470 - 1800 ?
Places: London, Ireland, Scotland, North America.. ?
Language ?
df2 <- df %>% filter(publication.place == "London")
df2 <- df %>% filter(language == "French")
df2 <- df %>% filter(publication.year >= 1700 & publication.year < 1800)
top_plot(df2, "author.unique", 10)top_plot(df, "publication.place", 10)df2 <- df %>% filter(publication.country %in% c("France", "Germany")) %>%
group_by(publication.decade, publication.country) %>%
summarize(paper = sum(paper.consumption.km2, na.rm = TRUE), docs = n())
p <- ggplot(df2, aes(x = publication.decade, y = docs, color = publication.country)) +
geom_point() + geom_smooth()
print(p) ## Warning in loop_apply(n, do.ply): span too small. fewer data values than
## degrees of freedom.
## Warning in loop_apply(n, do.ply): pseudoinverse used at 1599
## Warning in loop_apply(n, do.ply): neighborhood radius 180.95
## Warning in loop_apply(n, do.ply): reciprocal condition number 0
## Warning in loop_apply(n, do.ply): There are other near singularities as
## well. 29224
## Warning in loop_apply(n, do.ply): span too small. fewer data values than
## degrees of freedom.
## Warning in loop_apply(n, do.ply): pseudoinverse used at 1599
## Warning in loop_apply(n, do.ply): neighborhood radius 180.95
## Warning in loop_apply(n, do.ply): reciprocal condition number 0
## Warning in loop_apply(n, do.ply): There are other near singularities as
## well. 29224
## Warning in loop_apply(n, do.ply): NaNs produced
| publication.place | paper | docs |
|---|---|---|
| London | 1.8523002 | 683 |
| Dublin | 0.2870011 | 75 |
| Edinburgh | 0.1851683 | 40 |
| Philadelphia Pa | 0.1192282 | 30 |
| Boston | 0.0146536 | 27 |
| Oxford | 0.0827044 | 22 |
| New York N.Y | 0.0031327 | 12 |
| unknown | 0.0100618 | 9 |
| Norwich | 0.0015466 | 6 |
| Amsterdam | 0.0036861 | 5 |
| Glasgow | 0.0161695 | 5 |
| Bristol | 0.0043038 | 4 |
| Paris | 0.0490283 | 4 |
| Providence R.I | 0.0001264 | 4 |
| Hartford Ct | 0.0005302 | 3 |
| Newport R.I | 0.0008837 | 3 |
| Norwich Ct | 0.0019448 | 3 |
| Watertown Ma | 0.0000000 | 3 |
| Williamsburg Va | 0.0006158 | 3 |
| Aberdeen | 0.0128302 | 2 |
| Boston Ma | 0.0049400 | 2 |
| Cambridge | 0.0011969 | 2 |
| Chester | 0.0008364 | 2 |
| New London Ct | 0.0001102 | 2 |
| Newburyport Ma | 0.0004750 | 2 |
| Newcastle | 0.0070300 | 2 |
| Salem Ma | 0.0016016 | 2 |
| Salisbury | 0.0012787 | 2 |
| St. Omer | 0.0017043 | 2 |
| York | 0.0003814 | 2 |
| Albany N.Y | 0.0007392 | 1 |
| Bennington Vt | 0.0022562 | 1 |
| Birmingham | 0.0005225 | 1 |
| Bombay | 0.0008100 | 1 |
| Bury | 0.0002964 | 1 |
| Calcutta | 0.0008645 | 1 |
| Cambridge Ma | 0.0002025 | 1 |
| Canterbury | 0.0008100 | 1 |
| Carmarthen | 0.0019950 | 1 |
| Charleston S.C | 0.0000640 | 1 |
| Cirencester | 0.0008100 | 1 |
| Cork | 0.0015808 | 1 |
| Coventry | 0.0001350 | 1 |
| Dunbar | 0.0002850 | 1 |
| Evesham | 0.0117040 | 1 |
| Exeter | 0.0001350 | 1 |
| Geneva | 0.0080769 | 1 |
| Gouda | 0.0004928 | 1 |
| Harrisburgh Pa | 0.0007362 | 1 |
| Hillsborough N.C | 0.0001350 | 1 |
| Hull | 0.0008100 | 1 |
| Leeds | 0.0043225 | 1 |
| Limerick | 0.0063726 | 1 |
| Litchfield Ct | 0.0000693 | 1 |
| Maidstone | 0.0004928 | 1 |
| New Bern N.C | 0.0014168 | 1 |
| Perth | 0.0003696 | 1 |
| Portsmouth N.H | 0.0004940 | 1 |
| Quebec | 0.0008768 | 1 |
| Twickenham | 0.0001232 | 1 |
| Vienna | 0.0000000 | 1 |
| Waltham | 0.0001350 | 1 |
| Washington D.C | 0.0002268 | 1 |
| Westminster Vt | 0.0000540 | 1 |
| Wilmington De | 0.0000693 | 1 |
| Winchester | 0.0008100 | 1 |
| Worcester | 0.0091784 | 1 |
ggplot(df2,
aes(x = log10(1 + docs), y = log10(1 + paper))) +
geom_text(aes(label = publication.place), size = 3) +
scale_x_log10() + scale_y_log10() Scotland, Ireland, US comparison:
df2 <- df %>%
filter(!is.na(publication.country)) %>%
group_by(publication.country) %>%
summarize(paper = sum(paper.consumption.km2, na.rm = TRUE),
docs = n()) %>%
arrange(desc(docs)) %>%
filter(publication.country %in% c("Scotland", "Ireland", "USA"))p1 <- ggplot(subset(melt(df2), variable == "paper"), aes(y = value, x = publication.country)) + geom_bar(stat = "identity") + ylab("Paper consumption")
p2 <- ggplot(subset(melt(df2), variable == "docs"), aes(y = value, x = publication.country)) + geom_bar(stat = "identity") + ylab("Title count")
grid.arrange(p1, p2, nrow = 1)What can we say about the nature of the documents? Pamphlets (<32 pages) vs. Books (>120 pages) ? Book size statistics and development over time
Estimated paper consumption by document size
~80 % of statistical analysis is tidying up of the data. Too often neglected and implicitly assumed by many tools. We provide new efficient tools also for this
With open data principles, no need to reinvent the wheel for the same (or similar) datasets
Things become stable. The research tool is corrected and perfected when it is transparent & potentially used also by others
Possibilities of reuse with similar datasets is great
Automatization allows reporting with minimal human intervention
Innovative use of computational and statistical methods
New tools for old questions derived from the discipline itself
Vast amounts of useful data not being shared or utilized
Open access not enough. We need open sharing of research data and methods to study “traditional” questions
Institutions that hold the raw data are reluctant to give full access to data (even to researchers of the same institution). Why?
Research process is not opened and research data is not shared in the Humanities. Transparency, reproduction, collaboration, new initiatives are missing. Why?
Short answer: Cultural change takes time. We need concrete examples in the core field of the Humanities that actually prove OPEN DATA PRINCIPLES as useful.